Close

1. Identity statement
Reference TypeConference Paper (Conference Proceedings)
Sitesibgrapi.sid.inpe.br
Holder Codeibi 8JMKD3MGPEW34M/46T9EHH
Identifier8JMKD3MGPAW/3RP2P48
Repositorysid.inpe.br/sibgrapi/2018/09.02.02.43
Last Update2018:09.02.11.29.09 (UTC) administrator
Metadata Repositorysid.inpe.br/sibgrapi/2018/09.02.02.43.26
Metadata Last Update2022:06.14.00.09.19 (UTC) administrator
DOI10.1109/SIBGRAPI.2018.00061
Citation KeyMaiaJulcHira:2018:MaLeAp
TitleA Machine Learning approach for Graph-based Page Segmentation
FormatOn-line
Year2018
Access Date2024, Apr. 27
Number of Files1
Size3626 KiB
2. Context
Author1 Maia, Ana Lucia Lima Marreiros
2 Julca-Aguilar, Frank Dennis
3 Hirata, Nina Sumiko Tomita
Affiliation1 University of São Paulo/State University of Feira de Santana
2 University of São Paulo
3 University of São Paulo
EditorRoss, Arun
Gastal, Eduardo S. L.
Jorge, Joaquim A.
Queiroz, Ricardo L. de
Minetto, Rodrigo
Sarkar, Sudeep
Papa, João Paulo
Oliveira, Manuel M.
Arbeláez, Pablo
Mery, Domingo
Oliveira, Maria Cristina Ferreira de
Spina, Thiago Vallin
Mendes, Caroline Mazetto
Costa, Henrique Sérgio Gutierrez
Mejail, Marta Estela
Geus, Klaus de
Scheer, Sergio
e-Mail Addressanamaia@ime.usp.br
Conference NameConference on Graphics, Patterns and Images, 31 (SIBGRAPI)
Conference LocationFoz do Iguaçu, PR, Brazil
Date29 Oct.-1 Nov. 2018
PublisherIEEE Computer Society
Publisher CityLos Alamitos
Book TitleProceedings
Tertiary TypeFull Paper
History (UTC)2018-09-02 11:29:09 :: anamaia@ime.usp.br -> administrator :: 2018
2022-06-14 00:09:19 :: administrator -> :: 2018
3. Content and structure
Is the master or a copy?is the master
Content Stagecompleted
Transferable1
Version Typefinaldraft
KeywordsPage segmentation
document image
machine learning
graph
connected components classification
convolutional neural network
AbstractWe propose a new approach for segmenting a document image into its page components (e.g. text, graphics and tables). Our approach consists of two main steps. In the first step, a set of scores corresponding to the output of a convolutional neural network, one for each of the possible page component categories, is assigned to each connected component in the document. The labeled connected components define a fuzzy over-segmentation of the page. In the second step, spatially close connected components that are likely to belong to a same page component are grouped together. This is done by building an attributed region adjacency graph of the connected components and modeling the problem as an edge removal problem. Edges are then kept or removed based on a pre-trained classifier. The resulting groups, defined by the connected subgraphs, correspond to the detected page components. We evaluate our method on the ICDAR2009 dataset. Results show that our method effectively segments pages, being able to detect the nine types of page components. Furthermore, as our approach is based on simple machine learning models and graph-based techniques, it should be easily adapted to the segmentation of a variety of document types.
Arrangement 1urlib.net > SDLA > Fonds > SIBGRAPI 2018 > A Machine Learning...
Arrangement 2urlib.net > SDLA > Fonds > Full Index > A Machine Learning...
doc Directory Contentaccess
source Directory Content
FInal_PaperID_50.pdf 01/09/2018 23:43 3.5 MiB
agreement Directory Content
agreement.html 01/09/2018 23:43 1.2 KiB 
4. Conditions of access and use
data URLhttp://urlib.net/ibi/8JMKD3MGPAW/3RP2P48
zipped data URLhttp://urlib.net/zip/8JMKD3MGPAW/3RP2P48
Languageen
Target FileFinal_PaperID_50.pdf
User Groupanamaia@ime.usp.br
Visibilityshown
Update Permissionnot transferred
5. Allied materials
Mirror Repositorysid.inpe.br/banon/2001/03.30.15.38.24
Next Higher Units8JMKD3MGPAW/3RPADUS
8JMKD3MGPEW34M/4742MCS
Citing Item Listsid.inpe.br/sibgrapi/2018/09.03.20.37 9
Host Collectionsid.inpe.br/banon/2001/03.30.15.38
6. Notes
Empty Fieldsarchivingpolicy archivist area callnumber contenttype copyholder copyright creatorhistory descriptionlevel dissemination edition electronicmailaddress group isbn issn label lineage mark nextedition notes numberofvolumes orcid organization pages parameterlist parentrepositories previousedition previouslowerunit progress project readergroup readpermission resumeid rightsholder schedulinginformation secondarydate secondarykey secondarymark secondarytype serieseditor session shorttitle sponsor subject tertiarymark type url volume


Close